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Abstract — For the purposes of foreground estimation, the true 
background model is unavailable in many practical circumstances 
and needs to be estimated from cluttered image sequences. We 
propose a sequential technique for static background estima- 
tion in such conditions, with low computational and memory 
requirements. Image sequences are analysed on a block-by- 
block basis. For each block location a representative set is 
maintained which contains distinct blocks obtained along its 
temporal line. The background estimation is carried out in a 
Markov Random Field framework, where the optimal labelling 
solution is computed using iterated conditional modes. The 
clique potentials are computed based on the combined frequency 
response of the candidate block and its neighbourhood. It is 
assumed that the most appropriate block results in the smoothest 
response, indirectly enforcing the spatial continuity of structures 
within a scene. Experiments on real-life surveillance videos 
demonstrate that the proposed method obtains considerably bet- 
ter background estimates (both qualitatively and quantitatively) 
than median filtering and the recently proposed "intervals of 
stable intensity" method. Further experiments on the Wallflower 
dataset suggest that the combination of the proposed method 
with a foreground segmentation algorithm results in improved 
foreground segmentation. 

I. Introduction 

Intelligent surveillance systems can be used effectively for 
monitoring critical infrastructure such as banks, airports and 
railway stations [1]. Some of the key tasks of these systems are 
real-time segmentation, tracking and analysis of foreground 
objects of interest [2], [3]. Many approaches for detecting and 
tracking objects are based on background subtraction tech- 
niques, where each frame is compared against a background 
model for foreground object detection. 

The majority of background subtraction methods adaptively 
model and update the background for every new input frame. 
Surveys on this class of algorithms are found in [4], [5]. 
However, most methods presume the training image sequence 
used to model the background is free from foreground objects 
[6], [7], [8]. This assumption is often not true in the case of 
uncontrolled environments such as train stations and airports, 
where directly obtaining a clear background is almost impos- 
sible. Furthermore, in certain situations a strong illumination 
change can render the existing background model ineffective, 
thereby forcing us to compute a new background model. 
In such circumstances, it becomes inevitable to estimate the 
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C++ source code available at http://arma.sourceforge.net/background_est/ 





Fig. 1. Typical example of estimating the background from an 
cluttered image sequence: (i) input frames cluttered with foreground 
objects, where only parts of the background are visible; (ii) estimated 
background. 

background using cluttered sequences (i.e. where parts of the 
background are occluded). A good background estimate will 
complement the succeeding background subtraction process, 
which can result in improved detection of foreground objects. 

The problem can be paraphrased as follows: given a short 
image sequence captured from a stationary camera in which 
the background is occluded by foreground objects in every 
frame of the sequence for most of the time, the aim is to 
estimate its background, as illustrated in Figure 1. This prob- 
lem is also known in the literature as background initialisation 
or bootstrapping [9]. Background estimation is related to, but 
distinct from, background modelling. Owing to the complex 
nature of the problem, we confine our estimation strategy 
to static backgrounds (e.g. no waving trees), which is quite 
common in urban surveillance environments such as banks, 
shopping malls, airports and train stations. 

Existing background estimation techniques, such as simple 
median filtering, typically require the storage of all the input 
frames in memory before estimating the background. This 
increases memory requirements immensely. In this paper we 
propose a robust background estimation algorithm in a Markov 
Random Field (MRF) framework. It operates on the input 
frames sequentially, avoiding the need to store all the frames. 
It is also computationally less intensive, enabling the system to 
achieve real-time performance — this aspect is critical in video 
surveillance applications. This paper is a thoroughly revised 
and extended version of our previous work [10]. 

We continue as follows. Section II gives an overview 
of existing methods for background estimation. Section III 
describes the proposed algorithm in detail. Results from exper- 
iments on real-life surveillance videos are given in Section IV, 
followed by the main findings in Section V. 



II. Previous Work 

Existing methods to address the cluttered background esti- 
mation problem can be broadly classified into three categories: 
(i) pixel-level processing, (ii) region-level processing, (iii) a 
hybrid of the first two. It must be noted that all methods 
assume the background to be static. The three categories are 
overviewed in the sections below. 

A. Pixel-level Processing 

In the first category the simplest techniques are based on 
applying a median filter on pixels at each location across all 
the frames. Lo and Velastin [11] apply this method to obtain 
reference background for detecting congestion on underground 
train platforms. However, its limitation is that the background 
is estimated correctly only if it is exposed for more than 50% 
of the time. Long and Yang [12] propose an algorithm that 
finds pixel intervals of stable intensity in the image sequence, 
then heuristically chooses the value of the longest stable inter- 
val to most likely represent the background. Bevilacqua [13] 
applies B ayes' theorem in his proposed approach. For every 
pixel it estimates the intensity value to which that pixel has 
the maximum posterior probability. 

Wang and Suter [14] employ a two-staged approach. The 
first stage is similar to that of [12], followed by choosing 
background pixel values whose interval maximises an objec- 
tive function. It is defined as Ni k /Si k where Ni k and Si k 
are the length and standard variance of the k-th interval of 
pixel sequence I. The method proposed by Kim et al. [15] 
quantises the temporal values of each pixel into distinct bins 
called codewords. For each codeword, it keeps a record of the 
maximum time interval during which it has not recurred. If this 
time period is greater than N/2, where N is the total number 
of frames in the sequence, the corresponding codeword is 
discarded as foreground pixel. The system recently proposed 
by Chiu et al. [16] estimates the background and utilises it for 
object segmentation. Pixels obtained from each location along 
its time axis are clustered based on a threshold. The pixel 
corresponding to the cluster having the maximum probability 
and greater than a time-varying threshold is extracted as 
background pixel. 

All these pixel based techniques can perform well when the 
foreground objects are moving, but are likely to fail when the 
time interval of exposure of the background is less than that 
of the foreground. 

B. Region-level Processing 

In the second category, the method proposed by Farin et 
al. [17] performs a rough segmentation of input frames into 
foreground and background regions. To achieve this, each 
frame is divided into blocks, the temporal sum of absolute 
differences (SAD) of the co-located blocks is calculated, and 
a block similarity matrix is formed. The matrix elements that 
correspond to small SAD values are considered as stationary 
elements and high SAD values correspond to non- stationary 
elements. A median filter is applied only on the blocks 
classified as background. The algorithm works well in most 



scenarios, however, the spatial correlation of a given block 
with its neighbouring blocks already filled by background is 
not exploited, which can result in estimation errors if the 
objects are quasi- stationary for extended periods. 

In the method proposed by Colombari et al. [18], each frame 
is divided into blocks of size N x N overlapping by 50% 
in both dimensions. These blocks are clustered using single 
linkage agglomerative clustering along their time-line. In the 
following step the background is built iteratively by selecting 
the best continuation block for the current background using 
the principles of visual grouping. The spatial correlations that 
naturally exist within small regions of the background image 
are considered during the estimation process. The algorithm 
can have problems with blending of the foreground and 
background due to slow moving or quasi- stationary objects. 
Furthermore, the algorithm is unlikely to achieve real-time 
performance due to its complexity. 

C. Hybrid Approaches 

In the third category, the algorithm presented by Gutchess et 
al. [19] has two stages. The first stage is similar to that of [12], 
with the second stage estimating the likelihood of background 
visibility by computing the optical flow of blocks between 
successive frames. The motion information helps classify an 
intensity transition as background to foreground or vice versa. 
The results are typically good, but the usage of optical flow 
for each pixel makes it computationally intensive. 

In [20], Cohen views the problem of estimating the back- 
ground as an optimal labelling problem. The method defines 
an energy function which is minimised to achieve an optimal 
solution at each pixel location. It consists of data and smooth- 
ness terms. The data term accounts for pixel stationarity and 
motion boundary consistency while the smoothness term looks 
for spatial consistency in the neighbourhood. The function is 
minimised using the a-expansion algorithm [21] with suitable 
modifications. A similar approach with a different energy 
function is proposed by Xu and Huang [22]. The function is 
minimised using loopy belief propagation algorithm. Both so- 
lutions provide robust estimates, however, their main drawback 
is large computational complexity to process a small number 
of input frames. For instance, in [22] the authors report a 
prototype of the algorithm on Matlab takes about 2.5 minutes 
to estimate the background from a set of only 10 images of 
QVGA resolution (320 x 240). 

III. Proposed Algorithm 

We propose a computationally efficient, region-level algo- 
rithm that aims to address the problems described in the 
previous section. It has several additional advantages as well 
as novelties, including: 

• The background estimation problem is recast into an 
MRF scheme, providing a theoretical framework. 

• Unlike the techniques mentioned in Section II, it does 
not expect all frames of the sequence to be stored in 
memory simultaneously — instead, it processes frames 
sequentially, which results in a low memory footprint. 



Fig. 2. (i) Example frame from an image sequence, (ii) partial background initialisation (after Stage 2), (iii) remaining background estimation 
in progress (Stage 3), (iv) estimated background. 



• The formulation of the clique potential in the MRF 
scheme is based on the combined frequency response 
of the candidate block and its neighbourhood. It is 
assumed that the most appropriate configuration results 
in the smoothest response (minimum energy), indirectly 
exploiting the spatial correlations within small regions of 
a scene. 

• Robustness against high frequency image noise. In the 
calculation of the energy potential we compute 2D Dis- 
crete Cosine Transform (DCT) of the clique. The high 
frequency DCT coefficients are ignored in the analysis 
as they typically represent image noise. 

A. Overview of the Algorithm 

In the text below we first provide an overview of the 
proposed algorithm, followed by a detailed description of its 
components (Sections III-B to III-E). It is assumed that at each 
block location: (i) the background is static and is revealed at 
some point in the training sequence for a short interval, and 
(ii) the camera is stationary. The background is estimated by 
recasting it as a labelling problem in an MRF framework. The 
algorithm has three stages. 

Let the resolution of the greyscale image sequence / 
be W x %. In the first stage, the frames are viewed as 
instances of an undirected graph, where the nodes of the 
graph are blocks of size N x N pixels 1 . We denote the 
nodes of the graph by Af(i, j) for i = 0, 1, 2, • • • , (W/N) - 1, 
j = 0, 1, 2, • • • , (Ti/N) - 1. Let If be the f-th frame of the 
training image sequence and let its corresponding node labels 
be denoted by Cf(i,j), and / = 1,2, where F is 

the total number of frames. For convenience, each node label 
Cf(i,j) is vectorised into an N 2 dimensional vector lf(i,j). 

At each node location (i, j), a representative set 7Z(i,j) is 
maintained. It contains distinct labels that were obtained along 
its temporal line. Two labels are considered as distinct (visu- 
ally different), if they fail to adhere to one of the constraints 
described in Section III-B. Let these unique representative 
labels be denoted by r k (i, j) for k = 1, 2, • • • , S (with S < F), 
where r& denotes the mean of all the labels which were 
considered as similar to each other (mean of the cluster). 
Each label has an associated weight W k which denotes 

1 For implementation purposes, each block location and its instances at 
every frame are treated as a node and its labels, respectively. 



its number of occurrences in the sequence, i.e., the number 
of labels at location (i, j) which are deemed to be the same 
as Tk(i,j). For every such match, the corresponding Tk(i,j) 
and its associated variance, T> k (i,j) are updated recursively 
as given below: 
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, £p and r£ ew , ££ ew are the values of r k and its 
associated variance before and after the update respectively, 
and 1/ is the incoming label which matched Y°j} d . It is assumed 
that one element of 7£(i, j) corresponds to the background. 

In the second stage, representative sets 7Z(i,j) having 
just one label are used to initialise the corresponding node 
locations B(i,j) in the background B. 

In the third stage, the remainder of the background is esti- 
mated iteratively. An optimal labelling solution is calculated 
by considering the likelihood of each of its labels along with 
the a priori knowledge of the local spatial neighbourhood 
modelled as a MRF. Iterated conditional mode (ICM), a 
deterministic relaxation technique, performs the optimisation. 
The framework is described in detail in Section III-C. The 
strategy for selecting the location of an empty background 
node to initialise a label is described in Section III-D. The 
procedure for calculating the energy potentials, a prerequisite 
in determining the a priori probability, is described in Sec- 
tion III-E. 

The overall pseudo-code of the algorithm is given in Algo- 
rithm 1 and an example of the algorithm in action is shown 
in Figure 2. 

B. Similarity Criteria for Labels 

We assert that two labels j) and Y k {i^j) are similar if 
the following two constraints are satisfied: 
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Stage 1 : Collection of Label Representatives 

1) U <- (null set) 

2) for / = 1 to F do 

a) Split input frame If into node labels, each with a size of N x N. 

b) for each node label Cf(i,j) do 

i) Vectorise node Cf(i,j) into \f(i,j). 

ii) Find the representative label r m (i, j) from the set 
ft(z, j) = (r fc (i, < k < S), matching to \ f (i,j) 
based on conditions in Eqns. (3) and (4). 

if (lZ(i,j) = {0} or there is no match) then 
k <- fc + 1. 

Add a new representative label Tk(i,j) ^— lf(i,j) to set 7Z(i,j) and initialise its weight, Wfc(i, j), to 1. 
else 

Recursively update the matched label r m (z, j) and its variance 
given by Eqns. (1) and (2) respectively. 
W TO (i,j)^ W m (i,j) + 1 
end if 
end for each 
end for 



Stage 2: Partial Background Initialisation 

1) B<-0 

2) for each set 1Z(i : j) do 

if (siz&(K(i,j)) = 1) then 
j) <- ri (i, j) . 

end if 
end for each 



3: Estimation of the Remaining Background 

1) Full background initialisation 

while (23 not filled) do 

if B = and has neighbours as specified in Section III-D then 

B(ijj) <(— Tmax(i,j), the label out of set lZ(i,j) which yields maximum value of the posterior probability described in 
Eqn. (12) (see Section III-C). 
end if 
end while 

2) Application oflCM 
iteration_count <(— 

while (iteration_count < total iterations) do 
for each set 1Z(i,j) do 

if P(r new (iJ)) > P(r id(i,j)) then 

B ^— Tnew where P(-) is the posterior probability defined by Eqn. (12). 

end if 
end for each 

iteration_count — iteration_count + 1 
end while 



Algorithm 1. Pseudo-code for the proposed algorithm. A C++ implementation is available at http://arma.sourceforge.net/background_est/ 




Fig. 3. The local neighbourhood system and its four cliques. Each 
clique is comprised of 4 nodes (blocks). To demonstrate one of the 
cliques, the the top-left clique has dashed links. 

Equations (3) and (4), respectively, evaluate the correlation co- 
efficient and the mean of absolute differences (MAD) between 
the two labels, with the latter constraint ensuring that the labels 
are close in TV 2 dimensional space. n rk ,ni f and o rk ,G\ f are 
the mean and standard deviation of the elements of labels r& 
and 1/ respectively, while d k (ij) = If(hj) ~ *k{ij)- 

7i is selected empirically (see Section IV), to ensure that 
two visually identical labels are not treated as being different 
due to image noise. T2 is proportional to image noise and is 
found automatically as follows. Using a short training video, 
the MAD between co-located labels of successive frames is 
calculated. Let the number of frames be L and A/5 be the 
number of labels per frame. The total MAD points obtained 
will be (L — l)Nb. These points are sorted in ascending order 
and divided into quartiles. The points lying between quartiles 
Qs and Qi are considered. Their mean, /xq and standard 
deviation, <tq , are used to estimate T2 as 2 x (/iQ + 2ctq ). 
This ensures that low MAD values (close or equal to zero) 
and high MAD values (arising due to movement of objects) 
are ignored (i.e. treated as outliers). 

We note that both constraints (3) and (4) are necessary. As 
an example, two vectors [1, 2, • • • , 16] and [101, 102, • • • , 116] 
have a perfect correlation of 1 but their MAD will be higher 
than 7*2 • On the other hand, if a thin edge of the foreground 
object is contained in one of the labels, their MAD may be 
well within 72- However, Eqn. (3) will be low enough to 
indicate the dissimilarity of the labels. In contrast, we note 
that in [18] the similarity criteria is just based on the sum of 
squared distances between the two blocks. 

C. Markov Random Field (MRF) Framework 

Markov random field/probabilistic undirected graphical 
model theory provides a coherent way of modelling context- 
dependent entities such as pixels or edges of an image. It has 
a set of nodes, each of which corresponds to a variable or a 
group of variables, and set of links each of which connects 
a pair of nodes. In the field of image processing it has been 
widely employed to address many problems that can be mod- 
elled as labelling problem with contextual information [23], 
[24]. 

Let X be a 2D random field, where each random variate 
(Vhj) takes values in discrete state space A. Let uj £ 




(i) (ii) 

Fig. 4. (i) Three cliques each of which has an empty node. The 
gaps between the blocks are for ease of interpretation only, (ii) Same 
cliques where the empty node has been labelled. The constraint 
of 3 neighbouring nodes to be available in 3 different directions 
as illustrated ensures that arbitrary edge continuities are taken into 
account while assigning the label at the empty node. 



Q be a configuration of the variates in X, and let ft be the set 
of all such configurations. The joint probability distribution of 
X is considered Markov if 

p(x = u) > 0, v uj £ n (5) 

and 

P \ x ( P ,q) , (Uj) + (P, q)) = P \ x N (iJ) ) (6) 

where Xn ( . j} refers to the local neighbourhood system of 

Unfortunately, the theoretical factorisation of the joint prob- 
ability distribution of the MRF turns out to be intractable. To 
simplify and provide computationally efficient factorisation, 
Hammersley-Clifford theorem [25] states that an MRF can 
equivalently be characterised by a Gibbs distribution. Thus 

e -U(uj)/T 

P(X = u>) = ^ (7) 

where 

Z = Y e - u ^' T (8) 

is a normalisation constant known as the partition function, T 
is a constant used to moderate the peaks of the distribution and 
U(uj) is an energy function which is the sum of clique/energy 
potentials V c over all possible cliques C: 

c/ H = E ce c KM (9) 

The value of V c (uo) depends on the local configuration of 
clique c. 

In our framework, information from two disparate sources is 
combined using B ayes' rule. The local visual observations at 
each node to be labelled yield label likelihoods. The resulting 
label likelihoods are combined with a priori spatial knowledge 
of the neighbourhood represented as an MRF. 



Let each input image //be treated as a realisation of the 
random field B. For each node B(i,j), the representative set 
H{i,j) (see Section III- A) containing unique labels is treated 
as its state space with each r k (i,j) as its plausible label 2 . 

Using Bayes' rule, the posterior probability for every label 
at each node is derived from the a priori probabilities and the 
observation-dependent likelihoods given by 

P(r k ) = l(r k )p(r k ) (10) 

The product is comprised of likelihood l(r k ) of each label r k 
of set 1Z and its a priori probability density p(r k ), conditioned 
on its local neighbourhood. In the derivation of likelihood 
function it is assumed that at each node the observation 
components are conditionally independent and have the 
same known conditional density function dependent only on 
that node. 

At a given node, the label that yields maximum a posteriori 
(MAP) probability is chosen as the best continuation of the 
background at that node. 

To optimise the MRF-based function defined in Eqn. (10), 
ICM is used since it is computationally efficient and avoids 
large scale effects 3 [24]. ICM maximises local conditional 
probabilities iteratively until convergence is achieved. 

Typically, in ICM an initial estimate of the labels is ob- 
tained by maximising the likelihood function. However, in our 
framework an initial estimate consists of partial reconstruction 
of the background at nodes having just one label which is 
assumed to be the background. Using the available background 
information, the remaining unknown background is estimated 
progressively (see Section III-D). 

At every node, the likelihood of each of its labels r k (k = 
1, 2, • • • , S) is calculated using corresponding weights W k (see 
Section III- A). The higher the occurrences of a label, the more 
is its likelihood to be part of the background. Empirically, the 
likelihood function is modelled by a simple weighted function 
given by: 



where W Ck = mm(W rnax , W k ) and W rnax = 5 x frame rate 
of the captured sequence 4 . 

As evident, the weight W of a label greater than W max 
will be capped to W max . Setting a maximum threshold value 
is necessary in circumstances where the image sequence has 
a stationary foreground object visible for an exceedingly long 
period when compared to the background occluded by it. For 
example, in a 1000-frame sequence, a car might be parked for 
the first 950 frames and in the last 50 frames it drives away. In 
this scenario, without the cap the likelihood of the car being 
part of the background will be too high compared to the true 
background and this will bias the overall estimation process 
causing errors in the estimated background. 

2 To simplify the notations, index term has been henceforth omitted. 

3 An undesired characteristic where a single label wrongly gets assigned to 
most of the nodes of the random field. 

4 It is assumed that the likelihood of a label exposed for a duration of 
5 seconds is good enough to be regarded as a potential candidate for the 
background. 



Relying on this likelihood function alone is insufficient 
since it may still introduce estimation errors even when the 
foreground object is exposed for just slightly longer duration 
compared to the background. 

Hence, to overcome this limitation, the spatial neighbour- 
hood modelled as Gibbs distribution (given by Eqn. (7)) is 
encoded into an a priori probability density. The formulation 
of the clique potential V c {uS) referred in Eqn. (9) is described 
in the Section III-E. Using Eqns. (7), (8) and (9) the cal- 
culated clique potentials V c (uo) are transformed into a priori 
probabilities. For a given label, the smaller the value of energy 
function, the greater is its probability in being the best match 
with respect to its neighbours. 

In our evaluation of the posterior probability given by 
Eqn. (10), the local spatial context term is assigned more 
weight than the likelihood function which is just based on 
temporal statistics. Thus, taking log of Eqn. (10) and assigning 
a weight to the prior, we get: 

log (P(r k )) = log (Z(r*)) + V log (p(r fc )) (12) 

where rj has been empirically set to number of neighbouring 
nodes used in clique potential calculation (typically 77 = 3). 

The weight is required in order to address the scenario 
where the true background label is visible for a short interval 
of time when compared to labels containing the foreground. 
For example, in Figure 2, a sequence consisting of 450 frames 
was used to estimate its background. The person was standing 
as shown in Figure 2(i) for the first 350 frames and eventually 
walked off during the last 100 frames. The algorithm was able 
to estimate the background occluded by the standing person. It 
must be noted that pixel-level processing techniques are likely 
to fail in this case. 

D. Node Initialisation 

Nodes containing a single label in their representative set 
are directly initialised with that label in the background (see 
Figure 2(ii)). However, in some rare situations there is a 
possibility that all the sets may contain more than one label. In 
such a case, the algorithm heuristically picks the label having 
the largest weight W from the representative sets of the four 
corner nodes as an initial seed to initialise the background. 
It is assumed atleast one of the corner regions in the video 
frames corresponds to a static region. 

The rest of the nodes are initialised based on constraints as 
explained below. In our framework, the local neighbourhood 
system [23] of a node and the corresponding cliques are 
defined as shown in Figure 3. A clique is defined as a subset 
of the nodes in the neighbourhood system that are fully 
connected. The background at an empty node will be assigned 
only if at least 2 neighbouring nodes of its 4-connected 
neighbours adjacent to each other and the diagonal node 
located between them are already assigned with background 
labels. For instance, in Figure 3, we can assign a label to node 
X if at least nodes B, D (adjacent 4-connected neighbours) 
and A (diagonal node) have already been assigned with labels. 
In other words, label assignment at node X is conditionally 



independent of all other nodes given these 3 neighbouring 
nodes. 

Node X has nodes D, B, E and G as its 4-connected 
neighbours. Let us assume that all nodes except X are labelled. 
To label node X the procedure is as follows. In Figure 3, four 
cliques involving X exist. For each candidate label at node X, 
the energy potential for each of the four cliques is evaluated 
independently given by Eqn. (13) and summed together to 
obtain its energy value. The label that yields the least value is 
likely to be assigned as the background. 

Mandating that the background should be available in at 
least 3 neighbouring nodes located in three different directions 
with respect to node X ensures that the best match is obtained 
after evaluating the continuity of the pixels in all possible 
orientations. For example, in Figure 4, this constraint ensures 
that the edge orientations are well taken into account in the 
estimation process. It is evident from examples in Figure 4 
that using either horizontal or vertical neighbours alone can 
cause errors in background estimation (particularly at edges). 

Sometimes not all the three neighbours are available. In 
such cases, to assign a label at node X we use one of its 4- 
connected neighbours whose node has already been assigned 
with a label. Under these contexts, the clique is defined as two 
adjacent nodes either in the horizontal or vertical direction. 

Typically, after initialising all the empty nodes an accurate 
estimate of the background is obtained. Nonetheless, in certain 
circumstances an incorrect label assignment at a node may 
cause an error to occur and propagate to its neighbourhood. 
Our previous algorithm [10] is prone to this type of problem. 
However, in the current framework the problem is successfully 
redressed by the application of ICM. In subsequent iterations, 
in order to avoid redundant calculations, the label process 
is carried out only at nodes where a change in the label of 
one of their 8 -connected neighbours occurred in the previous 
iteration. 




E. Calculation of the Energy Potential 

In Figure 3, it is assumed that all nodes except X are 
assigned with the background labels. The algorithm needs 
to assign an optimal label at node X. Let node X have S 
labels in its state space 1Z for k = 1, 2, • • • , S where one 
of them represents the true background. Choosing the best 
label is accomplished by analysing the spectral response of 
every possible clique constituting the unknown node X. For 
the decomposition we chose the Discrete Cosine Transform 
(DCT) [26] due to its decorrelation properties as well as ease 
of implementation in hardware. The DCT coefficients were 
also utilised by Wang et al. [27] to segment moving objects 
from compressed videos. 

We consider the top left clique consisting of nodes A, B, 
D and X. Nodes A, B and C are assigned with background 
labels. Node X is assigned with one of S candidate labels. 
We take the 2D DCT of the resulting clique. The transform 
coefficients are stored in matrix of size M x M (M = 27V) 
with its elements referred to as Ck(v,u). The term (7^(0,0) 
(reflecting the sum of pixels at each node) is forced to since 
we are interested in analysing the spatial variations of pixel 
values. 




Fig. 5. An example of the processing done in Section III-E. 
(i) A clique involving empty node X with four candidate labels in its 
representative set. (ii) A clique and a graphical representation of its 
DCT coefficient matrix where node X is initialised with candidate 
label 1. The gaps between the blocks are for ease of interpretation 
only and are not present during DCT calculation, (iii) As per (ii), but 
using candidate label 2. (iv) As per (ii), but using candidate label 3. 
(v) As per (ii), but using candidate label 4. The smoother spectral 
distribution for candidate 3 suggests that it is a better fit than the 
other candidates. 



Similarly, for other labels present in the state space of node 
X, we compute their corresponding 2D DCT as mentioned 
above. A graphical example of the procedure is shown in 
Figure 5. 

Assuming that pixels close together have similar intensities, 
When the correct label is placed at node X, the resulting 
transformation has a smooth response (less high frequency 
components) when compared to other candidate labels. 

The higher-order components typically correspond to high 
frequency image noise. Hence, in our energy potential calcu- 
lation defined below we consider only the lower 75% of the 
frequency components after performing a zig-zag scan from 
the origin. 

The energy potential for each label is calculated using: 

= \C k (v,u)\) (13) 

where P = ceil (VM 2 x 0.75) and uok is the local configu- 
ration involving label k. Similarly, the potentials over other 
three cliques in Figure 3 are calculated. 

IV. Experiments 

In our experiments the testing was limited to greyscale 
sequences. The size of each node was set to 16 x 16. The 
threshold T\ was empirically set to 0.8 based on prelimi- 
nary experiments, discussed in subsection IV- A3. T2 (found 
automatically) was found to vary between 1 and 4 when 
tested on several image sequences (Ti and T2 are described in 
Section III-B). 

A prototype of the algorithm using Matlab on a 1.6 GHz 
dual core processor yielded 17 fps. We expect that consider- 
ably higher performance can be attained by converting the 
implementation to C++, with the aid of libraries such as 
OpenCV [28] or Armadillo [29]. To emphasise the effective- 
ness of our approach, the estimated backgrounds were obtained 
by labelling all the nodes just once (no subsequent iterations 
were performed). 

We conducted two separate set of experiments to verify the 
performance of the proposed method. In the first case, we 
measured the quality of the estimated backgrounds, while in 
the second case we evaluated the influence of the proposed 
method on a foreground segmentation algorithm. Details of 
both the experiments are described in Sections IV- A and IV-B, 
respectively. 

A. Standalone Performance 

We compared the proposed algorithm with a median filter 
based approach (i.e. applying filter on pixels at each location 
across all the frames) as well as finding intervals of stable 
intensity (ISI) method presented in [14]. We used a total of 
20 surveillance videos: 7 obtained from CAVIAR dataset 5 , 
3 sequences from the abandoned object dataset used in the 
CANDELA project 6 and 10 unscripted sequences obtained 
from a railway station in Brisbane. The CAVIAR and and 

5 http://groups.inf.ed.ac.uk/vision/CAVIAR/CAVIARDATAl/ 
6 http ://www. multitel .be/ ~ va/ candela/ 



CANDELA sequences were chosen based on four criteria: (i) 
a minimum duration of 700 frames, (ii) containing significant 
background occlusions, (iii) the true background is available 
in at least one frame, and (iv) have largely static backgrounds. 
Having the true background allows for quantitative evaluation 
of the accuracy of background estimation. The sequences were 
resized to 320 x 240 pixels (QVGA resolution) in keeping with 
the resolution typically used in the literature. 

The algorithms were subjected to both qualitative and quan- 
titative evaluations. Subsections IV- A 1 and IV- A2 respectively 
describe the experiments for both cases. Sensitivity of 7i is 
studied in subsection IV- A3. 

1 ) Qualitative Evaluation: All 20 sequences were used for 
subjective evaluation of the quality of background estimation. 
Figure 6 shows example results on four sequences with dif- 
fering complexities. 

Going row by row, the first and second sequences are from 
a railway station in Brisbane, the third is from the CANDELA 
dataset and the last is from the CAVIAR dataset. In the first 
sequence, several commuters wait for a train, slowly moving 
around the platform. In the second sequence, two people 
(security guards) are standing on the platform for most of 
the time. In the third sequence, a person places a bag on the 
couch, abandons it and walks away. Later, the bag is picked up 
by another person. The bag is in the scene for about 80% of 
the time. In the last sequence two people converse for most of 
the time while others slowly walk along the corridor. All four 
sequences have foreground objects that are either dynamic or 
quasi-stationary for most of the time. 

It can be observed that the estimated backgrounds obtained 
from median filtering (second column) and the ISI method 
(third column) have traces of foreground objects that were 
stationary for a relatively long time. The results of the 
proposed method appear in the fourth column and indicate 
visual improvements over the other two techniques. It must 
be noted that stationary objects can appear as background to 
the proposed algorithm, as indicated in the first row of the 
fourth column. Here a person is standing at the far end of the 
platform for the entire sequence. 

2) Quantitative Evaluation: To objectively evaluate the 
quality of the estimated backgrounds we considered the test 
criteria described in [19], where the average grey-level error 
(AGE), total number of error pixels (EPs) and the number of 
"clustered" error pixels (CEPs) are used. AGE is the average 
of the difference between the true and estimated backgrounds. 
If the difference between estimated and true background pixel 
is greater than a threshold, then it is classified as an EP. We 
set the threshold to 20, to ensure good quality backgrounds. 
A CEP is defined as any error pixel whose 4-connected 
neighbours are also error pixels. As our method is based on 
region-level processing we calculated only the AGE and CEPs. 

The Brisbane railway station sequences were not used as 
their true background was unavailable. The remaining 10 
image sequences were used as listed in Table I. To maintain 
uniformity across sequences, the experiments were conducted 
using the first 700 frames from each sequence. The back- 
ground was estimated in three cases. In the first case, all 
700 frames (100%) were used to estimate the background. To 




Fig. 6. (i) Example frames from four videos, and the reconstructed background using: (ii) median filter, (iii) ISI method [14], (iv) proposed 
method. 



evaluate the quality when less frames are available (e.g. the 
background needs to be updated more often), in the second 
case the sequences were split into halves of 350 frames 
(50%) each. Each sub-sequence was used independently for 
background estimation and the obtained results were averaged. 
In the third case each sub- sequence was further split into 
halves (i.e., 25% of the total length). Further division of 
the input resulted in sub- sequences in which parts of the 
background were always occluded and hence were not utilised. 
The averaged AGE and CEP values in all three cases are 
graphically illustrated in Figure 7 and tabulated in Tables I 
and II. The visual results in Figure 6 confirm the objective 
results, with the proposed method producing better quality 
backgrounds than the median filter approach and the ISI 
method. 
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Percentage of input frames used Percentage of input frames used 

(i) (ii) 
Fig. 7. Averaged values of AGE (i) and CEPs (ii) obtained by using 100%, 50% and 25% of the input sequences. 
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TABLE I 

Averaged grey-level error (AGE) results from experiments on 10 image sequences. The results under case 2 and 

CASE 3 (USING 50% AND 25% OF THE INPUT SEQUENCE, RESPECTIVELY) WERE OBTAINED BY AVERAGING OVER THE TWO AND FOUR 

SUB-SEQUENCES RESPECTIVELY. 
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TABLE II 

As per Table I, but using clustered error pixels (CEPs) as the error measure. 
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Fig. 8. Effect of 7i on AGE, while using a fixed value of Ti- 
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Average 


16.67% 



TABLE III 

Relative percentage improvement in foreground 

SEGMENTATION similarity (EQN. 14), OBTAINED ON THE 

Wallflower dataset, resulting from the use of the 
mrf-based parameter estimation in comparison to 
direct parameter estimation. the similarity value of 

moved object SEQUENCE TURNS OUT TO BE ZERO (DUE TO THE 
ABSENCE OF TRUE POSITIVES IN ITS GROUND-TRUTH) AND IS 
THEREFORE NOT LISTED. 



3) Sensitivity of T\: To find the optimum value of 71, we 
chose a random set of sequences from the CAVIAR dataset, 
whose true background was available a-priori and computed 
the averaged AGE between the true and estimated backgrounds 
for various values of T\ as indicated in Figure 8. As shown, 
the optimum value (minimum error) was obtained at 7i =0.8. 

B. Evaluation by Foreground Segmentation 

In order to show the proposed method aids in better seg- 
mentation results, we objectively evaluated the performance 
of a segmentation algorithm (via background subtraction) on 
the Wallflower dataset. We note that the proposed method 
is primarily designed to deal with static backgrounds, while 
Wallflower contains both static and dynamic backgrounds. As 
such, Wallflower might not not optimal for evaluating the 
efficacy of the proposed algorithm in its intended domain, 
however it can nevertheless be used to provide some suggestive 
results as to the performance in various conditions. 

For foreground object segmentation estimation, we use a 
Gaussian based background subtraction method where each 
background pixel is modeled using a Gaussian distribution. 
The parameters of each Gaussian (i.e., the mean and variance) 
are initialised either directly from a training sequence, or via 
the proposed MRF-based background estimation method (i.e. 
using labels yielding the maximum value of the posterior 
probability described in Eqn. (12) and their corresponding 
variances, respectively). The median filter and ISI [14] meth- 
ods were not used since they do not define how to compute 
pixel variances of their estimated background. 

For measurement of foreground segmentation accuracy, 
we use the similarity measure adopted by Maddalena and 
Petrosino [30], which quantifies how similar the obtained 
foreground mask is to the ground- truth. The measure is defined 
as: 

tp 

similarity = — (14) 

tp + fp + fn 

where similarity e [0,1], while tp, fp and fn are total number 
of true positives, false positives and false negatives (in terms 
of pixels), respectively. The higher the similarity value, the 



better the segmentation result. We note that the similiarity 
measure is related to precision and recall metrics [31]. 

The parameter settings were the same as used for measur- 
ing the standalone performance (Section IV- A). The relative 
improvements in similarity resulting from the use of the MRF- 
based parameter estimation in comparison to direct parameter 
estimation are listed in Table III. 

We note that each of the Wallflower sequences addresses 
one specific problem, such as dynamic background, sudden 
and gradual illumination variations, camouflage, and boot- 
strapping. As mentioned earlier, the proposed method is pri- 
marily designed for static background estimation (bootstrap- 
ping). On the 'Bootstrap' sequence, characterised by severe 
background occlusion we register a significant improvement 
of over 62%. On the other sequences, the results are only 
suggestive and need not always yield high similarity values. 
For example, we note a degradation in the performance on 
'TimeOfDay' sequence. In this sequence, there is steady 
increase in the lighting intensity from dark to bright, due to 
which identical labels were falsely treated as 'unique'. As a 
result, estimated background labels variance appeared to be 
smaller than the true variance of the background, which in 
turn resulted in surplus false positives. Overall, MRF based 
background initialisation over 6 sequences achieved an average 
percentage improvement in similarity value of 16.67%. 

C. Additional Observations 

We noticed (via subjective observations) that all background 
estimation algorithms perform reasonably well when fore- 
ground objects are always in motion (i.e., in cases where the 
background is visible for a longer duration when compared 
to the foreground). In such circumstances, a median filter 
is perhaps sufficient to reliably estimate the background. 
However, accurate estimation by the median filter and the 
ISI method becomes problematic if the above condition is not 
satisfied. This is the main area where the proposed algorithm 
is able to estimate the background with considerably better 
quality. 



The proposed algorithm sometimes mis-estimates the back- 
ground in cases where the true background is characterised by 
strong edges while the occluding foreground object is smooth 
(uniform intensity value) and has intensity value similar to that 
of the background (i.e., low contrast between the foreground 
and the background). Under these conditions, the energy 
potential of the label containing the foreground object is 
smaller (i.e., smoother spectral response) than that of the label 
corresponding to the true background. 

From our experiments we found the memory footprint to 
store the state space of all the nodes is on average only 5% 
of the memory required for storing all the frames. This is 
in contrast to existing algorithms, which typically require the 
storage of all the frames before processing can begin. 

We conducted additionally experiments on image sequences 
represented in other colour spaces, such as RGB and YUV, and 
evaluated the overall posterior as the sum of individual pos- 
teriors evaluated on each channel independently. The results 
were marginally better than those obtained using greyscale 
input. We conjecture that this is because the spatial continuity 
of structures within a scene are well represented in greyscale. 

V. Main Findings and Future Work 

In this paper we proposed a background estimation al- 
gorithm in an MRF framework that is able to accurately 
estimate the static background from cluttered surveillance 
videos containing image noise as well as foreground objects. 
The objects may not always be in motion or may occlude the 
background for much of the time. 

The contributions include the way we define the neighbour- 
hood system, the cliques and the formulation of clique poten- 
tial which characterises the spatial continuity by analysing data 
in the spectral domain. Furthermore, the proposed algorithm 
has several advantages, such as computational efficiency and 
low memory requirements due to sequential processing of 
frames. This makes the algorithm possibly suitable for imple- 
mentation on embedded systems, such as smart cameras [32], 
[1]. 

The performance of the algorithm is invariant to moderate 
illumination changes, as we consider only AC coefficients of 
the DCT in the computation of the energy potential defined by 
Eqn. (13). However, the similarity criteria defined by Eqns. (3) 
and (4) creates multiple representatives for the same visually 
identical block. Tackling this problem efficiently is part of 
further research. We also intend to extend this work to estimate 
background models of non- static backgrounds. 

Experiments on real-life surveillance videos indicate that the 
algorithm obtains considerably better background estimates 
(both objectively and subjectively) than methods based on me- 
dian filtering and finding intervals of stable intensity. Further- 
more, segmentation of foreground objects on the Wallflower 
dataset was also improved when the proposed method was 
used to initialise the background model based on a single 
Gaussian. We note that the proposed background estimation 
algorithm can be combined with almost any foreground seg- 
mentation technique, such as [8], [33]. 
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